01/25/05 TUE 15:21 FAX 515 334 6883 



PIONEER HI-BRED DSM 



[21037 



© 2000 Oxford University Press ^leic Acids Research 2000. VoL 29. No, 1 263-266 



The Pfam Protein Families Database 

Alex Bateman*, Ewan Birney, Richard Durbin, Sean R- Eddy 1 , Kevin L. Howe and 
Erik L. L. Sonnhammer 2 

The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK, 
^Department of Genetics, Washington University School of Medicine, St Louis, MO 631 10. USA and 
zcenter for Genomics Research, Karolinska Institutet, S-171 77 Stockholm. Sweden 

Received October 1 , 1999; Accepted October 4, 1999 



ABSTRACT 

Pfam is a lai^e collection of protein multiple 
sequence alignments and profile hidden Markov 
models. Pfam is available on the WWW in the UK at 
http://vvww.sanger.ac.uk/Software/Pfam/ T in Sweden 
at http://www.cgr.ki.se/Pfam/ and in the US at http:// 
pfam.wustl.edu/ . The latest version (4.3) of Pfam 
contains 1815 families. These Pfam families match 
63% of proteins in SWISS-PROT 37 and TrEMBL 9. 
For complete genomes Pfam currently matches up to 
half of the proteins. Genomic DNA can be directly 
searched against the Pfam library using the W2se2 
package. 

INTRODUCTION 

Pfam is a database of protein domain families. Pfam contains 
curatcd multiple sequence alignments for each family, as well 
US profile hidden Markov models (profile HMMs) for finding 
these domains in new sequences. Pfam contains functional 
annotation, literature references and database links for each 
family. There are Lwo multiple alignments for each Pfam 
family, the seed alignment that contains a relatively small 
number of representative members of the family and the full 
alignment that contains all members in the database that can be 
detected. All alignments use sequences taken from pfamscq, 
which is a non-redundant protein set composed of S WISS-PROT 
and SP-TrEMBL. The profile HMM is built from the seed align- 
ment using the HMMER package (see httpy/hmm«^ wustl.edu/ ), 
which is then used to search the pfainseq sequence database. 
All the matches found above the curated thresholds an.- aligned 
using the profile HMM to make the full alignment. The largest 
full alignment in Pfam, for the HIV GP120 glycoprotein, has 
>16 000 members, yet the seed alignment only has 24 repre- 
sentative members. The latest version of Pfam (43) contains 
1815 families that have matches to 63% of sequences, covering 
45% of residues in the sequence database. 

One of the main goals of Pfam was to aid the annotation of 
the Caenorhabditis elegans genome (1). Traditional approaches to 
large scale sequence annotation use a pairwisc sequence 
comparison method such as BLAST (2) to lind similarity to 
proteins of known function. Annotations are then transferred 
from the protein of known function to the predicted protein. 
The pairwise similarity search does not give a clear indication 



of the domain structure of the proteins. Mistakes in annotation 
can result from not considering the domain organisation of 
proteins (3). For example a protein may be misannotatcd as an 
enzyme when the similarity is only to a regulatory domain. 
Since its inception, Pfam has been developed to provide broad 
support for automated protein sequence classification and 
annotation. During the last year there have been significant 
changes and extensions to Pfam, which further this role. 

Pfam WEBSITES 

There arc currently three Pfam websites that are maintained 
independently. All of the sites contain core functionality, 
including searching the Piam library of HMMs, searching the 
text annotation of Pfam and viewing the multiple alignments 
for each family. A few new features are not yet implemented 
on all sites. 

The Pfam WWW servers can present the domain architecture of 
a protein graphically as 'beads on a string 7 with a colour-coded 
and hyperlinked bead for each domain (4), To get an overview 
of the different domains involved in a family, it is possible to 
list graphical schematics for all family members in one view. 
By browsing the sequence annotation together with these 
schematics, one can get a rough idea of the evolution and func- 
tional implications of domain combinations. For instance, if a 
certain combination is uniquely associated with proteins of a 
distinct functional class, this would suggest that other proteins 
with this combination have the same function. Likewise, if a 
certain combination is present in a certain taxonomic group 
only, it may confer a function that is specific for those organisms. 
If a combination is found scattered over a range of taxa, this 
might suggest that it arose multiple times independently. 

For a more fine-grained nnnlydi^ of the evolution of domain 
architectures, we have developed a novel tool that displays the 
graphical domain schematics of each sequence connected in an 
evolutionary tree. This tool is implemented as a Java applet, 
NIFAS, which at present is available from the Pfam servers in 
Sweden and the UK. It requires Netscape 4.5 or Internet 
Explorer 4.0. An example of a NTF AS view is shown in Figure 1 . 
Trees are calculated forai Pfam seed or full multiple alignments. 
We art currently using the neighbour-joining tree construction 
method in Clustalw (5), NIFAS can be used to analyse whether 
two or more domains have co-evolved or have recombincd 
recently. For instance, the bacterial sugar transferase proteins 
PTFl_RHOCA (P23388) and PTFIJXANCP (P45597) are 
clustered together in Figure 1, in which the tree was calculated 
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for the enzymatic domain. The N1FAS view based on the 
EH A 2 domain shows the same two sequences grouped, and 
based on the HPR domain they are grouped too, although not 
as reliably (data not shown). This analysis thus suggests that an 
ancestral protein existed with all three domains, and the two 
present proteins arc its direct descendants. t 

Pfam-A is supplemented by Pfam-B, however it has previously 
not been possible to annotate new proteins with magics to 
Pfam-B families. Protein sequence submitted to the UK Pram 
search server is now automatically searched for Pfam-B 
domains (as well as the standard search for Pfam-A dumams^ 
This is performed by using BLAST2 to search agamst a database 
of the sequence fragments that form Pfam-B, with some post- 
processing of Ihc results- Sequence segments matching a Pfam-B 
family can then be aligned against the family using a profik 
HMM. These profile HMMs are built on-me-fly; piffle 
HMJvJs for Pfam-B families arc not currently part of the Pfam 

distribution. . .... 

A further enhancement of PfanTs utility is the addihon Oi 
structural information to alignments with ™«*™^ h ^ 
ID-structure. Secondary structure and relanve solvent accessibility 
values extracted from the DSSP database (6) are mcluded as 
alignment markups (labels *-GR.. SS' and #=GR„ SA ) as 
Of Pfam 4.3. Furthermore, the corresponding entries in the 
PDB database (7) arc referenced with residue coordinates^ 
These references are linked to rasmol (8) for visualisation of 
the structural entity that corresponds to the Pfam domain. 

CHANGES TO Pfam-B 

Pfam-B is an automatically generated supplement to Pfam-A, 
that provides completeness in terms of coverage. ™m-J* 
also provided a useful resource for new Plam-A families. Piam 



version 4 has seen a marked change in the way that Pfam-B is 
constructed. Up to and including the 3.4 release of Pfam, 
Pfam-B was constructed using the Domainer algorithm (9). 
The basis for this algorithm was a computationally expensive 
all-against-all BLAST comparison of the subsequent not found in 
Plam-A. As a result it became in feasible to re-construct Pfam-B at 
every monthly release. 

Since the 4.0 release of Pfam, Pfam-B has been constructed 
using the ProDom database of protein domain families ()0), 
which is a high quality automatically generated protein families 
database constructed over the same underlying sequence data- 
base as Pfam (SWl$S4>ROT and TrEMBL)- The new construction 
process for Piam-B is fast, and as a result Pfam-B is now re-built 
at every point monthly release. Pfam-B in principle is made 
from the parts of ProDom not covered by Pfam-A. The Pfam-B 
construction process is conceptually a function taking a 
ProDom alignment as input and giving between zero and three 
Pfam-B families as output. The function is applied to all families 
in ProDom to form Pfam-B, Jn some cases, a ProDorr. tamily is 
effectively subsumed by one or more Pfam-A families. These 
ProDom families are ignored. In other cases, a ProDom tamily 
has no overlap with any Pfam-A liimily. These alignments 
become Pfam-B families with no alteration. More interesting 
are cases where the ProDom alignment is truncated or bisected 
by a Pfam-A family, as displayed pictorially in Figure 2. In 
these cases, the ProDom alignment is cut at the maximal extent 
of ihc intruding Pfam-A family to form one (Fig, 2a), or in the 
case of bisection, two or three Pfam-B families (Fig- 2b). Here, 
the domain boundaries of Pfam-A are used to infer domain 
boundaries for Pfam-B families. New Pfam-B alignments arc 
only included if they are wider than 20 columns. Cases such as 
the 'bisection' example shown in Figure 2 are particularly 
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Figure 2. Two po SR ibic types of overlap between 
areshown. (a) A partial overlap that gives one EW& * mi £- J£> 
wVrc the Wam-A family is subsumed by a ProDom family to create three 
numbered Pfam-B families. 

useful for Pfam curation. In such cases Ihe ProDom family has 

more members than the Pfam-A family it subsumes, and this 

implies that perhaps the Pfam-A family is missing some 

members. By adding a link from the , lew 

the Pram-A family, this potential deficit is flagged for future 

consideration. 



QUALITY CONTROL 

Curatinc a large number of families presents many challenges 
for quality control, bolh for annotation and family membership. 
We have recently added a spell checking functionality to Pfam, 
allowing us to store a dictionary of words that arc allowed in 
the free text lines of Pfam. 

PfanvB is now providing useful quality control for Ptam-A 
thai was not present before. The comparison of Pfam-A and 
ProDom that occurs during l>fam~B construction has provided 
Pfam with an excellent way to detect missing members ol 
families. This has led to large increases in membership for 
some families. For example, in Pfam version 4.1 the neake 
domain family (PF00355) had 51 members. This was found lo- 
be related to PfanvB family 31 by ProDom. By^ludmg some of 
the related rieske domains from Pfam-B 31 in the seed alignment 
the new Pfam-A profile HMM found 192 rieske domains. 



One of the most important quality controls is the overlap 
check. This states that no residue of any protein can belong to 
more than one family. As new families arc added to Pfam an 
overlap to an existing family may signify that the new family is 
related to a preexisting family. In this case we can extend the 
existing family to include the members of the new family. The 
overlap could also be due to incorrectly choosing domain 
boundaries for a family, which can be easily fixed by rnmming 
the seed alignment. As Pfam's residue coverage increases this 
control becomes more stringent and therefore more usetui. 

SEARCHING GENOMES WITH Pfam 

An important goal of Pfam is to enable rapid automatic classi- 
fication of predicted proteins into protein domain families, 
Pfam is used around the world as an aid to genomic annotation 
in one of two ways: (i) Pfam can be used to annotate protem 
translations using the HMMer software; or (n) Pfam can be 
used to predict genes and annotate genomic DNA using the 
Wisc2 package. 

Although Pfam's coverage across the sequence databases is 
high (63%), wc know that these databases are biased towards 
some protein families and organisms. Therefore it is useful to 
know what fraction of protein sequences in whole genome 
sequencing projects are annotated by Pfam analysis. Table 1 
shows a summary of a Pfam/HMMER analysis of the predicted 
proteins from five representative genomes: the bactena 
Escherichia coli and kicketlsia prowazekii, the nematode 
Caenorhobditis elegons, the yeast S*ccharomyccs cerevisiae 
and the archaeon Methanococcus jamoschiL Pfam identifies 
domains in 40-50% of the proteins in each genome, except for 
the archaea! M.jannaachW genome where the fraction is some- 
what lower (33%). This compares favourably to the fraction of 
proteins thai can be annotated by standard pair-wise BUVST 
analysis: for example, the worm genome project reported that 
-42% of worm proteins had an informative BLAST similarity 
to a non-nematode protein (11). 

Inching the number of models in Pfam will increase the 
hit rate, of course. However, the expected return on such an 
effort is less than one might guess, as illustrated m Table 2. As 
a rough rule of thumb, the 10-20 largest protein families can 
account for -10% of each genome. To cover 20%, it takes 



Tabic 1. Tbe fraetion of proteins ami frnetlnn of residues hit by a PCim analysis tit cud. of five genome* 



Zcersvisiw M.janntischii 



Total no. proteins 
No. proteins hit 
Protein coverage (%) 
Total no. residue* 
No. residues nil 
Rcsidu* coverage (%) 



4290 
2020 
47 

1 363 501 
493 103 
36 



837 
421 
50 

280 233 
105 33S 
38 



16 332 
6344 
39 

7 120 I IS 
I 515 030 
21 



6305 
2542 
40 

2 9R3 R22 
652 191 
22 



1771 
582 
33 

501 797 
128 4A9 
26 



, 664 rf*n 4.2 HMM, were scorched fr*. hrnn^rcK 

Every protein domain satisfying the S^^^ > ™' 

(Ive genome protein da^ h u P :/^. S nn r r.ae.uk^ojcc«/ 

btrp:// C volclion.bmc.uu,^^ ORFs/; ^ja,maschii t release 

9/29/9R http://wv^tigr.or^^ 
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T*bic 2. The numbers of ruling Pfor» families in five complete genomes 
" E.coli ~ ILpnjyXtzekii C.c/qga/tf 



$.cerevisiau Mjannmchii 



No. families to cover 10% 


14 


17 


No. familial to cawr 20% 


61 


59 


No. families to covet 30% 


151 


139 


No, Pfam fomUiC9 with u hit 


694 


337 


No. families 'unique' to genome 


105 





9 

40 
156 
815 
185 



13 
69 
224 
717 
26 



23 
98 
282 
339 
8 



,50-1 00 families; to get 30%, it takes -1S0-3OC I families; and 
to get our full current coverage in each genome, it tolas ~500-iuuu 
families. THe representation of a given, protein ram iry vanes 
substantially from genome to genome. The top 1 0 famihes that 
account for 10% of one genome are not the same as > the itoplO 
families in another genome. For example, the largest bartcnal 
protein family is the ABC transporter family; in the two 
eukaryotes, the protein kinases arc the most numerous. The last 
line in Tabic 2 shows the number of Pfam families that show 
one or more hits in one genome but no hits ,n any of the other 
genomes, showing that there is substantial non-overlap m the 
representation of Pfam families in various genomes. Also, 471 
of the 1664 Pfam 4.2 models showed no hits to any proteins m 
fcese five genomes; many of these model* cover protein families 
bueciflc to vertebrates or viruses. 

A considerable amount of sequence data is released as raw 
genomic sequence. Analysis of this sequence 
hampered by the presence of (i) introns and (n) frameshifting 
sequencing Uin the DNA sequence, wtoch mak«i 
deducing the protein sequence of genes contained in ^ the 
genomic DNA sequence difficult. It ,s estimated that -50% of 
nietazoan exons are predicted correctly when Standard gene 
programs arc run (T.Hubbard, personal communication)^ If 
Pfam is searched against protein translations of genomic DNA 
in many cases valid protein domains are missed due to ^the 
inaccuracy of gene prediction. The algorithm G^Wise (12) 
allows a protein profile HMM to be compared dmxdy to 
genomic DNA, without the need for any gene P^£on and 
allowing for potential frameshilling sequencing errors. GeneWise 
contains a gene prediction method which it integrates witfr flic 
profile HMM during the comparison. Teste of GeneWise show 
thai it produces 98% accurate gene predictions in ttte region l OI 
the homology (R.Guigo, personal communicanon). Unfortunate* 
GeneWise is a very CPU expensive program and comparing 
100 kb of DNA sequence to the enlirt Pfam library takes -30 h 
on a Unix server machine (Compaq Alpha). 

To allow Ihe large scale application of Pfam to genomic 
DNA we used a prc-fdter that incorporated a Perl script called 
hallwise based on BLASTX (2) to cut the running time down 
to an average of 2 h. The BLAST search ,s of the 0NA 
sequence against a constructed protein database which 
tempts to represent Pfam bits sensibly. This ,s """ffta 
the Pfcm full alignment, and making them non-rcdundant to a 
maximum pairwise identity of 75%. This P?' fll J^™ n *™ 
a low threshold to select candidate profile HMMs to be 
compared to the DNA sequence using GencWise In tests, the 
senskivity loss of using this pie-f.lter was -10%, and it also 



showed greater robustness towards low complexity regions m 
the genomic data, such as unmasked microsatclbte repeats. 
Halfwisc is part of the Wisc2 package that provides access to 
the GeneWise algorithm in a number of different forms (sec 
http://www.sanger.ac.uWSoftware/Wisc2/ ). 

AVAILABILITY OF Pfam 

Pfam is available on the WWW in Europe at ^/www 
sanecr ac uk/SoRware/Pfam/ and http://www.cgr.ki.se/Pfam/ , and 
in the US at http://pfam.wustl.edu/ . The Pfam distobution 
contains a number of files: Pfam-Aseed and Pfam-ATull 
contain the seed and full alignments wim annotation in Stockholm 
format; Pfam is a file containing the library of Pfam profile 
HMMs; PtamFiag is a library of profile HMMs designed 
specifically 10 find matches to protein fragments; SwissPiam is 
a file containing the domain organisation lor each protein in 
the database; Pfam-B contains the dala for Pfam-Ji families m 
Stockholm format; diff is a file containing the changes 
between release to allow incremental updates of Pfam derived 
data- pfemseq contains the underlying sequence database, in 
fasta formal, that all sequences in Pfam are taken from. 
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